Transcriptional signatures and machine learning provide a framework for better understanding estrogen receptor positive breast cancer
Abstract
Estrogen receptor (ER) is a key driver of ER positive breast cancer (BC) and targeted by endocrine therapies. Among ER+ BC patients, one half of all recurrences happen after five years of endocrine therapy. In this paper we show that among ER+ BC patients that received only endocrine therapy, molecular estrogen signaling signatures are prognostic on TCGA, METABRIC and SCANB (over 10000 patients in total). Even among tumors with ER IHC percentage greater than 90%, estrogen signatures are prognostic, showing a discrepancy between ER IHC and molecular data. In order to better understand the molecular biology of these tumors, we developed a new pipeline using unsupervised machine learning algorithms to integrate and interpret publicly available datasets in a single sample manner, such that patient samples can be used regardless of the cohort of origin and sequencing platform. To validate the pipeline, we used SCANB (RNAseq), normal mammary gland samples (RNAseq), patient derived xenografts (PDX, RNASeq) and data from the POETIC trial (microarray, ER+ only patients). We show that the molecular landscape, where samples are integrated and embedded together, captures the different PAM50 molecular subtypes and does not depend on clinical variables, such as age, tumor stage and tumor purity. Moreover, all patients from the POETIC trial whose embeddings in the molecular landscape are on the basal region were non-responders. We further show using the POETIC data that looking at the neighborhoods in the molecular landscape of a responder and non-responder we see differences in estrogen signaling, highlighting potential differences in the treatment. PI3K AKT MTOR signaling and G2M checkpoint are two pathways that are drivers of the molecular landscape and we showed that they are also are prognostic for RFS on SCANB and METABRIC, showing possible alternative treatments with drugs currently in the market for early stage BC patients. Finally, we derive a new risk score based on sample position in the molecular landscape and on clinical factors that outperforms commercially available signatures and provides additional information when used in combination with the risk of recurrence (ROR). Thus, we show how the molecular landscape based on transcriptomics data could help in better understanding the tumor biology.
Introduction
Breast cancer is the most diagnosed cancer worldwide (1). It is a very heterogeneous disease that can be subdivided in different histological and molecular subtypes. Patient tumors are classified based on the expression as detected by IHC of the estrogen, progesterone receptors and HER2 membrane protein. The estrogen receptor positive (ER+) breast cancer subtype represents more than 70% of all cases (2). Tumor cells of this subtype contain the estrogen receptor whose main mechanism is the target of selective estrogen receptor modulator (SERM), selective estrogen receptor degrader (SERD) and aromatase inhibitors (AI). To receive endocrine therapy, tumors need to be ER+, which according to the ASCO guidelines are defined as having at least 1% of the tumor cells expressing estrogen receptor (3), whereas the Swedish national guidelines uses a threshold of 10% for ER status (4). It has been shown that the patients in the low spectrum of ER positivity, from 1% to 10%, do not benefit as much as patients with 10% to 100% of ER positive cells in the tumor (5).
Several gene signatures already are being used, such as OncotypeDX, Mammaprint and Prosigna to assign chemotherapy for those patients with higher risk of recurrence (6–8). This score might be associated with the commercial signatures,
Transcriptomics and genomics have revolutionized cancer research. The new tools opened the option to better understand the molecular underpinnings of cancer biology. In breast cancer, molecular subtyping through the use of next generation sequencing technologies and qRT-PCR have been used to aid clinicians when taking the decision of giving chemotherapy for patients, avoiding potential overtreatment of patients. Researchers have developed gene signatures that are able to assign a risk score. Such score is based on survival data found in the literature and the higher the risk score of a patient, the more likely the recurrence, therefore the use of additional chemotherapy in first-line therapy is advised. Some of the risk scores are already recommended in international guidelines (10) which are already being used in the clinics (6–10). They are all based in small sets of genes that were obtained after several statistical procedures, usually using forward variable selection on cox regression. However, these signatures do not provide a complete explanation on why a patient should receive additional therapy, they are mostly prognostic. What is known is that they are composed of several submodules pathways, such as estrogen, proliferation and HER2 signaling and that some signatures are correlated to some of its submodules. For example, the recurrence score (RS) is associated with the estrogen module and the Risk of Recurrence (ROR) signature is correlated with its proliferation module (11). An alternative to the signatures only approach is to use RNA-sequencing in the clinics (12). The SCANB consortium showed that it is possible to use mutational and gene expression based biomarkers within one week of tumor surgery (13–15) in the clinics. They also showed that it is possible to use RNA-seq to calculate the Prosigna risk scores and risk stratification (16), showing the potential all in one use of RNA-seq instead of using different assays from different companies.
Even though there are several new analytical techniques to understand the risk of recurrence for ER+ BC patients, there is still a lack of understanding of the patient’s tumor molecular biology. There are challenges to overcome in the pathway analysis for patients individually and how to compare patients molecularly when considering complete transcriptomics data. There is no tool that allows to integrate patients in a single sample way and that does not require large amounts of data. In most cases integration is a one step procedure and cannot be updated. The common tools used in the RNA-seq community for batch effect removal are (17–19) and new samples cannot be integrated in a straightforward manner. The only way is to re-run the procedure together with the new sample. Such a procedure is not usually feasible as there are not enough samples to estimate batch effects across groups and therefore correct them.
We propose here a new pipeline for integrating publicly available datasets and developing a framework for personalized medicine and the understanding of molecular pathways at the patient level. By developing a normalization and embedding technique, we show that it is possible to integrate publicly available molecular datasets, such as TCGA, SCANB, METABRIC, microarray data of some patients from the POETIC trial, patient derived xenografts and healthy breast tissue (12,20–23) into a common space, called molecular landscape. By using a special normalization and principal component analysis we can combine the datasets together explore the full molecular landscape of RNA-sequencing and microarray samples together. The new molecular landscape allows us to add new one single sample independently, showing its molecular subtype without the need of any other classification tool that has been validated in another dataset or centering techniques. It is possible to compare patients in a small neighborhood of the molecular landscape, providing a context to better understand the biology of the individual tumor. Moreover, we propose the use of different gene sets from the hallmarks collection of the Molecular Signature Database (MSigDB) (24,25), enabling understanding of individual patient tumor biology.
Alternative treatments are necessary for early stage ER+ BC patients. Currently chemotherapy and immunotherapy ared used as a second line therapy for these patients. Here we show that G2M checkpoint and PI3K AKT MTOR signaling pathways are important drivers of the molecular landscape and are predictive in METABRIC and SCANB among patients that received only endocrine therapy. Drugs are available for these pathways, but they are only approved for metastatic BC. Moreover, Novartis release a statement revealing the results of an interim analysis of the phase III NATALEE trial, showing the benefit of a CDK4/6 inhibitor (ribociclib) in combination with ET compared to ET only for early stage ER+ BC patients (26). Our data here further supports the results of this interim analysis.
To conclude the framework we also developed a risk score signature that depends not on a specific subset of genes but on the position of the molecular landscape and clinical factors. We show that when used in combination with the ROR score it is still statistically significant and provides additional information.
A R package is provided on github (github.com/chronchi/molecular_landscape) with instructions on how to use the newly created molecular landscape and how to integrate new samples and calculate the risk scores.
Methods
Cohorts
Publicly available breast cancer cohorts TCGA, SCANB, METABRIC and POETIC (12,20–22) were used to calculate the principal component analysis (PCA) embeddings and validation. From TCGA, SCANB and METABRIC, only primary breast cancer samples were included. Each cohort has over 1000, 1000 and 8000 primary samples respectively, totalling over 10000 samples. The POETIC trial comprises of only post-menopausal women diagnosed with primary breast cancer and considered to have an estrogen receptor positive status (IHC) in the initial diagnosis.
TCGA was downloaded from Firebrowse. The version 3 of SCANB data (StringTie FPKM Gene Data unadjusted) was downloaded from Mendeley https://data.mendeley.com/datasets/yzxtxn4nmd (27). METABRIC was downloaded directly from cBioPortal. POETIC data was downloaded from GEO (accession code GSE105777) using the R package GEOquery. For details about downloading and preprocessing steps see https://chronchi.github.io/transcriptomics.
Selection of highly variable genes
In order to perform the PCA, we sub selected 1000 common and highly variable genes in the TCGA and METABRIC cohorts. For each gene and in each cohort separately, the coefficient of variation (CV) was calculated:
\[ \text{CV} = \frac{\text{standard deviation}}{\text{average expression level}}. \]
The 1000 genes with highest average CV were selected.
qPCR-like normalization
Given a list of 44 stable genes across different cancers (28) and 1000 genes selected, all 1044 genes were ranked from lowest to highest expression for each sample separately and the rankings were divided by the average ranking of the 44 stable genes.
PCA embedding
Using the normalized data as described in the qPCR-like normalization section, a total of 1000 random samples (fixed seed in R) from TCGA and METABRIC were selected to perform the initial PCA, using PCAtools (29) in R. PCA was performed without centering and scaling since the data is already centered and scaled for all genes and samples. The embedding for new individual samples is obtained by multiplying the loadings matrix with the sample normalized expression. For samples where not all the 1044 genes are available, normalization was performed and the missing genes were padded with 0.
Scoring strategies
For the 4 big cohorts, TCGA, SCANB, METABRIC and POETIC, GSVA (30) was applied along with the \(SET_{ER/PR}\) signature (31) and the hallmark collection from the molecular signature database (24,32). Default parameters were used in the gsva function from the GSVA package.
Average neighborhood scores
In order to calculate the posterior distribution of the average scores in each neighborhood, a linear regression with only intercept (score ~ 1) was fitted using rstanarm (33) and the function stan_glm for each pathway individually. When applying the function stan_glm, we used four chains and a prior normal distribution with location \(0\) and scale equals to \(1\). The package tidybayes (34) was used to extract the draws and put them in a tidy format.
Survival analysis
Survival analysis was done using cox regression with the survival package from R. Variables used for adjustment were age, tumor stage and number of positive lymph nodes for TCGA and SCANB, age and the Nottingham Prognostic Index (NPI) for METABRIC. Overall survival was performed for all three cohorts and recurrence free survival for METABRIC and SCANB. It could be that there are unmeasured confounders that are affecting the pathways and outcomes at the same time, leading to a biased result.
Risk score
We used cox regression on data of ER+ breast cancer positive patients that received only endocrine therapy. A subset of samples from METABRIC was used as a training cohort. The rest of the samples were used as a test to evaluate the prognostic value of the signature, SCANB was used as a validation cohort. The cox regression was calculated with the following covariates: PC3, PC4, tumor size, node status (negative if no node, positive otherwise) and age. Risk score was calculated as the sum of the coefficients multiplied by the respective values for each patient.
\[ \begin{aligned} -0.09 \times & \text{PC3} + 0.08 \times \text{PC4} + 0.02 \times \text{Tumor size} \\+ & 0.58 \times \text{Node status} + 0.003 \times \text{Age} \end{aligned} \]
In order to compare the risk score and its clinical utility we performed a likelihood ratio test with the full model (risk score, binary category coming from ROR, tumor size, age and node status) and the base model (without the risk score).
A risk score using only the principal components was calculated and the formula is provided below.
\[ -0.09 \times \text{PC3} + 0.09 \times \text{PC4} \]
Code availability
The code used to generate all the analysis is available on https://github.com/chronchi/molecular_landscape. To fully reproduce the images in this paper check the instructions on how to run the docker image and RStudio session at the github repository. An online version with a website containing all the analysis and code can be found on https://chronchi.github.io/molecular_landscape. To check the code in the previous link click in source code on each chapter separately.
Results
Estrogen receptor signaling is a clinical continuous variable
Currently ER status is used for the decision to administer endocrine therapy and it is considered binary with cutoff for ER+ defined as >1 to >10% of tumor cells expressing the ER by IHC. We hypothesized that a more functional readout as estrogen receptor (ER) signaling activity determined by scores from bulk RNA-seq samples is a more adequate measure among patients that received only endocrine therapy. To test this hypothesis we used the estrogen signatures HALLMARK_ESTROGEN_RESPONSE_EARLY and HALLMARK_ESTROGEN_RESPONSE_LATE that were extracted from the molecular signature database (24) (MSigDB) and \(SET_{ER/PR}\) (31). The signatures from MSigDB each contain 200 genes and the latter signature has 18 genes that are associated with estrogen and progesterone receptor signaling and applied them to three independent breast cancer RNA based datasets, TCGA, SCANB and METABRIC (16,20,21) to calculate estrogen signaling scores. Each of these cohorts comprises more than 1000 untreated patient tumors. The individual scores for each patient sample were ranging from -1 to 1 in each of the three cohorts ( Figure 1 (a)) stratified by estrogen receptor status. There is a difference in the scores between the estrogen receptor (ER) positive and negative subgroups for each cohort separately. The ER- has mostly negative ER signaling scores. On the other hand, the ER+ group has scores that range from negative to positive, covering the whole spectrum of possible scores (Figure 1 (a)).
To assess how the functional ER score related to ER IHC index, we used the SCANB dataset to calculate the correlation between the scores. There is a small correlation between the molecular ER signaling score and ER percentage (Figure 1 (b)). Among those patients with high ER IHC index (over 90%), the distribution of ER signaling scores is all over the spectrum, indicating a mismatch between IHC and the functional form of the ER signaling score (Figure 1 (c)). ¨ Next we wanted to evaluate the predictive power of the molecular ER signaling scores, given that there is a mismatch between ER IHC percentage and molecular ER signaling score based on the SCANB dataset. Cox regression was used to determine the hazard ratio (HR) of the ER signaling in overall survival (OS) for TCGA, SCANB and METABRIC and recurrence free survival (RFS) for METABRIC and SCANB. Each survival analysis was done independently and adjusted for available clinical variables. Tumor size and number of lymph nodes were used for TCGA and SCANB cohorts. The Nottingham prognostic index (NPI) was used for METABRIC. Age was used in all cohorts as a clinical variable for adjustment. To better understand the effect of ER signaling in the clinics, samples from ER+ BC patients that received only endocrine therapy (for SCANB and METABRIC), or samples with the luminal A and B PAM50 molecular subtype (for TCGA). Such sampling may avoid a bias on the coefficients due to chemotherapy, as patients who receive chemotherapy tend to have a worse outcome compared to those that do not. In all the three cohorts, the hazard ratio for estrogen early (Figure 1 (d)) was below 1 (TCGA: 0.23, 0.05-0.99 CI; SCANB: 0.22, 0.10-0.45 CI; METABRIC: 0.56, 0.34-0.89 CI). This shows the continuous aspect of the functional ER signaling. For both hallmark estrogen response late and SET ER/PR, results were similar to estrogen response early (Table S1). In this case considering only the ER+ BC patients that received only ET we could argue that these signatures are both prognostic and preditictive of ET response.
Given that the ER IHC index could have an impact in the previous results, we only selected patients with high ER IHC index (> 90%) and re-performed survival analysis on the SCANB cohort (Figure 1 (e)). The HR for ER percentage is 1.01 (CI 0.88-1.03). But even among those patients the two estrogen signatures have a HR below 1 (OS: 0.26, 0.10-0.67; RFS: 0.48, 0.03-1.02). This illustrates that even among patients with high ER IHC index, there are those with low ER signaling scores indicating a poor sensitivity to ET.
Single sample integration preserves relevant breast cancer properties
Given that not all patients with high ER IHC percentage will respond the same to endocrine therapy, we aimed to better identify the non-responders based on molecular pathways and transcriptional signatures for a single patient. To this aim, we developed a single sample batch effect removal method to integrate microarray and bulk RNA-seq and create a molecular landscape, which is an embedding of the molecular data into a common space for all samples. The advantage of the method is when given a new sample, it can easily be integrated with all the other previous samples without retraining, therefore being possible to integrate in the clinical practice. The first step of the method is to normalize the samples to the same scale. The average ranking of housekeeping genes distribution is similar among the two RNA-seq datasets, SCAN-B and TCGA and their mean values are higher than the mean of the distribution on METABRIC (Figure S1 (a)). The normalization preserves the distinctions of gene expression levels between ER+ and ER- breast cancer samples, ESR1 and TFF1 being such examples (Figure S1 (b)).
The biplot in Figure 2 (a) with the third and fourth components from TCGA and METABRIC samples shows that the samples are overlapping across the two cohorts. All samples, including those used for training and validation, are plotted.
In order to check the robustness of the procedure, the analysis was repeated 10 times with 10 random subsets of patient samples from TCGA and METABRIC, simulating a cross validation process. Figure 2 (b) shows the embedding of some of the validation datasets. The blue dots correspond to the original embedding of the samples, the red dots to the new embedding and the lines are connecting the same sample. Whenever a new embedding is performed, all samples are either shifted in the same direction or they are reflected along a common axis, showing the invariance properties of the embedding. (Figure 2 (b)).
Missing genes are a problem inherent to publicly available datasets. To test the impact of the missing genes in the embedding, genes were ranked based on their loadings of the third and fourth component and then we removed 200 genes in total with a varying proportion of the top loading genes (ranging from 0 to 100% with a 5% step). The higher the proportion the less precise the embedding is, with the embedding converging towards the origin, i.e., the (0,0) coordinates (Figure 2 (c)). If the proportion of top genes missing are below 30% (60 genes) the embedding is not heavily impacted (Figure 2 (c)).
The third component corresponds to the separation between ER+ and ER- BC patients in both cohorts (Figure 2 (d)). A combination of the third and fourth components shows a good distinction among the PAM50 molecular subtypes (Figure 2 (e)), showing that the embedding captures important molecular information. The fourth component mostly divides the luminal A and luminal B subtypes, whereas the normal-like subtype is spread across the third and fourth component. Since the normal samples are both close to the luminal A and basal subtypes in terms of the euclidean distance, one should only compare samples that are close to each other for pathway analysis.
Figure 2 (f) shows a gradient of the ER signaling score \(SET_{ER/PR}\). The higher values are on the far right of the third component, decreasing in value when moving from right to left on the third component. Other clinical factors, such as tumor stage, node stage, age, NPI and tumor purity show no influence in the embedding (Figure S2).
Embedding generalizes to a validation cohort
So far only samples from TCGA and METABRIC were used in the training and projection. SCANB was used as an external validation cohort. Similar to the previous results, by inspecting the third and fourth components SCANB overlapped with METABRIC and TCGA (Figure 3 (a)). ER+ and ER- BC patients are well separated (Figure 3 (b)) and the procedure can also distinguish the molecular subtypes (Figure 3 (c)) on top of the other samples coming from TCGA and METABRIC. We show that the embedding works for the SMC cohort (35) (Figure S3). As an RNA-seq cohort, it is expected that SCANB samples will be closer to TCGA than to METABRIC when removing batch effects, mostly due to platform. Biplot of PC1 and PC2 (Figure 3 (d)) shows that SCANB is closer to TCGA than to METABRIC.
Together, these findings suggested that our approach is robust. Next we also the molecular landscape with patient derived xenografts (PDX) (23). These PDXs are breast cancer cells derived from patient tumor samples obtained from clinics and injected into the mammary gland of mice (36). By using the MIND model, we can engraft cells coming from ER+ BC tumors. Moreover, to have a succesfull engraftment rate, the cells need to grow so they can establish across the ducts. Therefore, the cells when extracted for RNA-sequencing are usually in a proliferative state. Figure 3 (e) shows the embedding of six different PDX samples with multiple biological replicates (23). These samples are in the scattered around the luminal B region, showing that the landscape captures the biology of the experiment and it shows the intertumor heterogeneity among ER+ BC patients in a research environment. Moreover, we used 66 samples coming from women that performed reduction mammoplasty in Switzerland. Figure 3 (f) shows that these samples are in the region of the normal-like PAM50 molecular subtype, as expected.
Molecular landscape is a tool to understand and explore patient heterogeneity
Since the molecular landscape relies in a single sample to obtain the embedding, we can use samples from any other cohort and compare to clinical factors. To provide a use of the molecular landscape in the clinical practice, we used the POETIC trial microarray data. The POETIC trial (37) was a trial that evaluated the use of perioperative aromatase inhibitors in ER+, postmenopausal BC patients. Its primary endpoint was time to recurrence. Matched samples from treated and untreated patients at baseline (before treatment) and at surgery (after an average 14 days of treatment) (22) were sent for microarray hybridization. The patients have matched Ki67 percentage levels, which can be considered an indication of how well a patient responded to the endocrine therapy. Patients with more than 5% of baseline Ki67 and a reduction of 60% upon endocrine therapy are considered responders, otherwise they are defined as non responders.
In order to gain insights on the differences between responders and non responders, we embedded the POETIC trial samples using the procedure and studied molecular landscape (Figure 4 (a) left). The samples are spread across the whole molecular landscape, consistent with patients having different molecular biological properties. Furthermore, the POETIC samples are embedded closer to the METABRIC samples (Fig S4). Given the available information, the BC patients that are ER+ and in the left part of the landscape (ER- patients), are all non responders (Figure 4 (a) right). Moreover, some of the non responder patients are actually in the basal cluster (Figure 4 (c)). We selected two samples, from patients that were responder and non responder and are close in the molecular landscape (Figure 4 (c)) to highlight their molecular differences and see what is in their context. Figure 4 (d) shows the average posterior distribution of the neighborhood for the responder patient. The responder patient has a ER signaling score higher than the average. On the other hand, the non responder has a smaller ER signaling score than the average (Figure 4 (e)) and also a higher androgen signaling score (Androgen response). Other pathways, such as EMT, E2F targets, P53 and TGF\(\beta\) signaling along with their average posterior distributions are shown for both patients.
Generating hypothesis: subgroups and alternative treatments
A key aspect of precision medicine is finding the right patients for targeting with the proper drug. With the molecular landscape we can try to identify different subgroups, based on molecular pathway scores. We identify several pathways driving the distinction between samples in the embedding, such as G2M checkpoint, epithelial to mesenchymal transition, DNA repair and PI3K AKT MTOR signaling (Figure 5 (a)). The scores go from one direction to other across the two different principal components, indicating that they complement each other when capturing information. Not only G2M checkpoint is differentiating luminal A and B patients, EMT also is.
The overall and recurrence free survival calculated from patients that have ER+ BC, received only endocrine therapy (METABRIC: OS/RFS and SCANB: OS/RFS) and have \(PC3 > 0\) show that G2M checkpoint has a hazard ratio higher than 1 (Figure 5 (b); Table S2) and a tight confidence interval, which reflects the fact the higher the proliferation the worse the outcome, and reflects the subtyping differences between luminal A and B. The PI3K AKT MTOR signaling has a hazard ratio higher than 1 with a wider confidence interval, indicating a possible subgroup among all selected patients that may benefit from additional adjuvant therapy targeting this specific pathway.
Development of a new risk score and validation
We hypothesized that the molecular landscape contains clinically relevant information and that it could either complement or outperform the use of commercially available gene signatures. We used the third and fourth components separately in a cox regression, adjusted by the clinical variables mentioned in the methods section. In both overall survival and recurrence free survival analysis, the PC3 had a hazard ratio below 1 for the METABRIC, SCANB and SCANB high ER IHC percentage (more than 90%) BC patients that received only endocrine therapy. On the other hand, PC4 had hazard ratios higher than 1 in all cohorts for both OS and RFS (Figure 6 (a)). To further validate the clinical aspect of the molecular landscape, we calculated the euclidean distance between the surgery and baseline matched samples of the POETIC trial data. Consistent with our hypothesis that the position of in the molecular landscape is clinically relevant, the average distance of the embedding position is higher in responders than non-responders (Figure 6 (b)).
To further evaluate the clinical importance of the positions in the molecular landscape, we developed a new risk score based on clinical variables (tumor size, age and node status) and the position in the molecular landscape (PC3 and PC4 scores). Table 1 shows the results of the recurrence free survival analysis performed using these variables as coefficients and ER+ BC patients treated with ET only. The risk score developed follows a gradient from the bottom right of the molecular landscape to the upper left of the embedding (Figure 6 (c)) among all the patients treated with endocrine therapy only and that have ER+ BC.
To provide aditional benefit based on the scores in the To further validate the clinical utility of the score, we compared the new developed risk score with the risk of recurrence score available from the SCANB patients calculated by the nearest centroid technique (16). We see that there is a positive correlation among the scores, moreover there is a population of luminal A patients that present a high risk score that is not captured by the ROR, as by definition luminal A patients are of low/intermediate risk. We further showed that by using the binary categorization of the ROR into high and low/intermediate risk groups (16) there is additional benefit in using the new risk score in a survival analysis (p-value = 0.02). Figure 6 (e) shows that there is a distinction in the distributions of the low/intermediate and high risk patients, but even among the low/intermediate there are those with risk scores similar to those of high risk, which are by majority luminal B. As another way of validating the risk score, we calculated the risk scores using only the principal components on the POETIC trial data. Responders have a decrease in average in the risk score. Non-responders have almost the same risk score (Figure 6 (f)).
Discussion
Personalized medicine is a key topic in medicine and BC (23). The goal of better understanding the molecular underpinnings of the diseases leads to a better allocation of treatments and resources in the patient care. This has been shown to be necessary by using PDX mouse models, where different PDXs respond differently to hormone treatments (23). We have shown here a possible framework to deal with personalized medicine in breast cancer in general with a focus on ER+ BC patients.
Here we show using cox regression and multiple cohorts (12,20,21) with both RNA-seq and microarray data, that the function ER signaling score is a continuous predictive marker. Among the high ER IHC index patients (>90%), the ER score is still a predictive maker, highlighting the added benefit of using transcriptomic assays to understand the biology of the tumors. This can aid when evaluating treatment options and avoiding overtreatment. The functional ER scores presented here might be associated with the commercial signatures, as it has been shown that OncotypeDX’s estrogen module is highly correlated with the general OncotypeDX signature (11).
Integrating molecular data stemming from different sources is a challenge. On one hand batch effect tools are usually able to remove the batch effects across the different sources of variability (17,18), on the other hand they are not single sample based, meaning each time a new sample comes the algorithm needs to be run again. It is also based on the fact one has enough data in the different datasets, otherwise it skews the possible integration towards one of the datasets. Here we show by using TCGA, METABRIC and SCANB that it is possible to integrate the samples from these cohorts in a single sample manner. The samples show good mixing when using test samples not seen during the training stage. The embeddings preserve key molecular features of breast cancer. PC3 is clearly driven by estrogen receptor signaling where from right to left there is a gradient in the ER scores, such as Estrogen early and \(SET_{ER/PR}\). On the other hand, PC4 is what makes a difference between the molecular subtypes luminal A and B, which in practice differ by proliferation status in terms of Ki67 levels (38).
Moreover, the embeddings in the validation set (SCANB) preserve the key features of breast cancer. Samples are projected by their different PAM50 molecular subtypes and there is a gradient of estrogen signaling pathway from ER+ BC towards ER- BC patients in all the three cohorts combined. The first two components reflect batch effects. Expectedly, SCANB projected closer to TCGA, since both datasets are RNA-seq. On the other hand, part of the POETIC trial was sequenced using microarray dataset (22,37), and we show that it is projected closer to METABRIC as expected according to the first two components. This highlights that the method can capture information from different technologies and remove such batch effects. Moreover all the samples were sent for sequencing in different contexts at different times and populations, showing the power of the embedding method in removing batch effects and capturing truly the biology of breast cancer.
Sometimes when dealing with publicly available datasets, not all of the genes are available due to ethical protocols (12), pre-processing or technical reasons. We showed how robust the projection is to missing genes with high loadings in the projection. If less than 30% of the top genes (ranked by the loadings of the third and fourth principal components) are missing, we can recover the position of the embedding if all the. The more genes that are missing, the closer the projection will be to the origin, i.e., the (0,0) coordinate in the embedding of the third and fourth components of the PCA.
To show the clinical validity of the projection, we used a subset of patients from the POETIC trial with microarray and clinical information (22). The samples projected as expected and surprisingly the samples that are considered to be non responders upon 2 weeks of aromatase inhibition are projected among the ER- BC patients. When looking further upon two different patients that have similar embedding but different response to endocrine therapy, we see that the responder has a higher value of estrogen signaling than the average. On the other hand, the non responder patient has a smaller estrogen signaling score than the average, which suggests a possible explanation for the difference in response. Moreover, estrogen and androgen receptor signaling have been shown to be tightly linked (39) and these two patients have different androgen signaling scores. The non responder has a higher score than the average compared to the responder, whose score is just the same as the average.
Complementing the personalized approach, the molecular landscape can be used as a starting point to understand breast cancer molecular subgroups in a more intuitive way using pathways. Using survival analysis we show that some of the pathways that are important for the embedding can be used for subgrouping and hypothesis generation. Current clinical trials (40) are evaluating the use of PI3K-AKT-MTOR inhibitors in advanced metastatic BC patients, here we show weak evidence on the molecular level that such treatment could benefit early stage primary BC patients.
A weakness of the proposed pipeline is that we rely on GSVA scores, which can be used and compared across different cohorts since they have a representation of all molecular subtypes. A way to circumvent this is by having a small library of RNAseq or microarray samples that are representative of the patient population. This way when scoring a new patient, the scores can be compared across different cohorts. There is still a cost barrier for using RNAseq dataset in the clinical setting, but efforts are being made across the industry to reduce the sequencing costs and make it more widespread. An example is Alithea genomics, a company aiming to provide large-scale RNA-sequencing by using Bulk RNA barcoding and sequencing (BRB-seq).
With the development of new therapies, the use of better tools to empower clinicians is necessary. We believe that RNA-seq is a very powerful technique that can be used in clinical practice. The SCANB team (12,16) has shown how feasible it is to have a pipeline set up for RNA-sequencing. We have shown here how all the data that was generated can be combined and used both for clinical practice and also for research purposes, highlighing the power of RNA-seq.
References
| Coefficient | Hazard Ratio | Std error | p-value | CI low | CI high |
|---|---|---|---|---|---|
| PC3 | 0.91 | 0.0273 | 7.8e-04 | 0.87 | 0.96 |
| PC4 | 1.08 | 0.0226 | 3.6e-04 | 1.04 | 1.13 |
| tumor_size | 1.02 | 0.0043 | 3.7e-07 | 1.01 | 1.03 |
| node_statuspos | 1.79 | 0.1476 | 7.9e-05 | 1.34 | 2.39 |
| age | 1.00 | 0.0070 | 6.5e-01 | 0.99 | 1.02 |